Feature selection on a dataset of protein families: from exploratory data analysis to statistical variable importance
نویسندگان
چکیده
Proteins are characterized by several typologies of features (structural, geometrical, energy). Most of these features are expected to be similar within a protein family. We are interested to detect which features can identify proteins that belong to a family, as well as to define the boundaries among families. Some features are redundant: they could generate noise in identifying which variables are essential as a fingerprint and, consequently, if they are related or not to a function of a protein family. We defined an original approach to analyze protein features for defining their relationships and peculiarities within protein families. A multistep approach has been mainly performed in R environment: getting-cleaning data, exploratory data analysis and predictive modeling for classification. Ten protein families have been chosen by their CATH classification (different architectures), with rules over the number of structures, the length of the sequence and the choice of the chain. Properties investigated are secondary structures, hydrogen bonds, accessible surface areas, torsion angles, packing defects, number of charged residues, free energy of folding, volume and salt bridges. Kernel density estimation helps in discovering unusual multimodal profiles. Pearson’s correlation highlights statistical links between pairwise variables and Pearson’s distance provides a dendrogram with a clusterization of the features. PCA clusterizes the protein families and it detects outliers, sparse PCA performs a feature selection. Many classification algorithms have been used: decision trees (classical, boosting and bagging), SVMs (flexible discriminant analysis), centroid (nearest shrunken). The interest is on variable importance estimation. A 10-fold x 10 cross validation has been applied over the training set. Accuracy, K coefficient, sensitivity and specificity have been calculated for each methods. From the density plots, the percentage of mostly buried residues is significantly different for each family. Dissimilarity dendrogram shows separated clusters for secondary structures, torsion angles, defects and geometrical features. From the features network, torsion angles and surface variables result as peripheral (i.e. redundant) from the core of the graph. PCA biplot gives a good clustering for the protein families and sparse PCA confirm dendrogram results. Unifying all the results, these features are typical for our dataset: helix, strand, coil, turn, hydrogen bond, polar and charged accessible surface area, volume and residue buried for the most part. Random forest algorithm has the best performance values. Graphical multivariate procedures are good tools for the characterization of possible fingerprints about the protein families. Predictive models for classification and variable importance estimation help in performing feature selection. The work can be improved by the use of multivariate regression models and the increase of the protein families number.
منابع مشابه
Feature Selection in Structural Health Monitoring Big Data Using a Meta-Heuristic Optimization Algorithm
This paper focuses on the processing of structural health monitoring (SHM) big data. Extracted features of a structure are reduced using an optimization algorithm to find a minimal subset of salient features by removing noisy, irrelevant and redundant data. The PSO-Harmony algorithm is introduced for feature selection to enhance the capability of the proposed method for processing the measure...
متن کاملA New Hybrid Feature Subset Selection Algorithm for the Analysis of Ovarian Cancer Data Using Laser Mass Spectrum
Introduction: Amajor problem in the treatment of cancer is the lack of an appropriate method for the early diagnosis of the disease. The chemical reaction within an organ may be reflected in the form of proteomic patterns in the serum, sputum, or urine. Laser mass spectrometry is a valuable tool for extracting the proteomic patterns from biological samples. A major challenge in extracting such ...
متن کاملAn Overview of the New Feature Selection Methods in Finite Mixture of Regression Models
Variable (feature) selection has attracted much attention in contemporary statistical learning and recent scientific research. This is mainly due to the rapid advancement in modern technology that allows scientists to collect data of unprecedented size and complexity. One type of statistical problem in such applications is concerned with modeling an output variable as a function of a sma...
متن کاملDeveloping a Filter-Wrapper Feature Selection Method and its Application in Dimension Reduction of Gen Expression
Nowadays, increasing the volume of data and the number of attributes in the dataset has reduced the accuracy of the learning algorithm and the computational complexity. A dimensionality reduction method is a feature selection method, which is done through filtering and wrapping. The wrapper methods are more accurate than filter ones but perform faster and have a less computational burden. With ...
متن کاملA Case Study of Planning for Exploratory Data Analysis
To develop an initial understanding of complex data, one often begins with exploration. Exploratory data analysis (EDA) provides a set of statistical tools through which patterns in data may be extracted and examined in detail. We brie y describe an implemented planning representation for Aide, an automated EDA assistant. We discuss the analysis of a small dataset, in which exploration is drive...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید
ثبت ناماگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید
ورودعنوان ژورنال:
- PeerJ PrePrints
دوره 4 شماره
صفحات -
تاریخ انتشار 2016